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Abstract 

Convolutional Neural Networks (CNNs) can be shifted across 2D images or 
3D videos to segment them. They have a fixed input size and typically perceive 
only small local contexts of the pixels to be classified as foreground or background. 
In contrast, Multi-Dimensional Recurrent NNs (MD-RNNs) can perceive the en¬ 
tire spatio-temporal context of each pixel in a few sweeps through all pixels, es¬ 
pecially when the RNN is a Long Short-Term Memory (LSTM). Despite these 
theoretical advantages, however, unlike CNNs, previous MD-LSTM variants were 
hard to parallelize on GPUs. Here we re-arrange the traditional cuboid order of 
computations in MD-LSTM in pyramidal fashion. The resulting PyraMiD-LSTM 
is easy to parallelize, especially for 3D data such as stacks of brain slice images. 
PyraMiD-LSTM achieved best known pixel-wise brain image segmentation results 
on MRBrainS13 (and competitive results on EM-ISBI12). 


1 Introduction 

Long Short-Term Memory (LSTM) networks w are recurrent neural networks (RNNs) 
initially designed for sequence processing. They achieved state-of-the-art results on 
challenging tasks such as handwriting recognition Q, large vocabulary speech recog¬ 
nition 00 and machine translation Q. Their architecture contains gates to store and 
read out information from linear units called error carousels that retain information 
over long time intervals, which is hard for traditional RNNs. 
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Multi-Dimensional LSTM networks (MD-LSTM Q) connect hidden LSTM units 
in grid-like fashiorQ Two dimensional MD-LSTM is applicable to image segmenta¬ 
tion Sill) where each pixel is assigned to a class such as background or foreground. 
Each LSTM unit sees a pixel and receives input from neighboring LSTM units, thus 
recursively gathering information about all other pixels in the image. 

There are many biomedical 3D volumetric data sources, such as computed tomog¬ 
raphy (CT), magnetic resonance (MR), and electron microscopy (EM). Most previ¬ 
ous approaches process each 2D slice separately, using image segmentation algorithms 
such as snakes fTO) , random forests | [TT| and Convolutional Neural Networks IT^ . 3D- 
LSTM, however, can process the full context of each pixel in such a volume through 
8 sweeps over all pixels by 8 different LSTMs, each sweep in the general direction of 
one of the 8 directed volume diagonals. 

Due to the sequential nature of RNNs, however, MD-LSTM parallelization was dif- 
hcult, especially for volumetric data. The novel Pyramidal Multi-Dimensional LSTM 
(PyraMiD-LSTM) networks introduced in this paper use a rather different topology 
and update strategy. They are easier to parallelize, need fewer computations overall, 
and scale well on GPU architectures. 

PyraMiD-LSTM is applied to two challenging tasks involving segmentation of bi¬ 
ological volumetric images. Competitive results are achieved on EM-ISBI12 p3) ; best 
known results are achieved on MRBrainS13 OD- 


2 Method 

We will hrst describe standard one-dimensional LSTM Q and MD-LSTM. Then we 
introduce topology changes to construct the PyraMiD-LSTM, which is formally de¬ 
scribed and discussed. 

The original LSTM unit consists of an input gate (i), forget gat^(/), output gate 
(o), and memory cell (c) which control what should be remembered or forgotten over 
potentially long periods of time. All gates and activations are real-valued vectors: 
X, i, f, c, c,o,h G R^, where T is the length of the input. The gates and activations at 
discrete time t (f=l,2,...) are computed as follows: 


it = (j{xt 

• 6^i + ht-i 

^hi ^ihias ) i 

(1) 

ft = a{xt ■ 

^xf T ht-i • 

^hf H” ^fbias^j 

(2) 

Ct = tanh(xt 

■ 0XC + ht-i 

^hc ^cbias^'i 

(3) 


Ct = CtG 

' H + Ct-i © ft: 

(4) 

b 

II 

o 

Sxo + ht-i ■ 

^ho ^obias^: 

(5) 


ht = 

= Ot 0 tanh(Q) 

(6) 


where (•) is a matrix multiplication, (©) an element-wise multiplication, and 0 denotes 
the weights, c is the input to the ’cell’ c, which is gated by the input gate, and h 

*For example, in two dimensions this yields 4 directions; up, down, left and right. 

^Although the forget gate output is inverted and actually ‘remembers’ when it is on, and forgets when it 
is off, the traditional nomenclature is kept. 
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is the output. The non-linear functions a and tanh are applied element-wise, where 
= i+l-^ ■ Equations (1, 2) determine gate activations, Equation (3) cell inputs. 
Equation (4) the new cell states (here ‘memories’ are stored or forgotten). Equation (5) 
output gate activations which appear in Equation (6), the hnal output. 

2.1 Pyramidal Connection Topology 






-► 

♦ 





1 




J 





1 

J 

J 


J 





1 



J 

J 





1 

J 



■ 




















































< 










■»V 

p 









P 

P 







< 

\ 

P 

$ 

■ 






p 

< 

P 

< 

P 







< 

p 

P 








p 


















(a) Standard MD-LSTM (b) Turned' MD-LSTM (c) PyraMiD LSTM 


Figure 1: The standard MD-LSTM topology (a) evaluates the context of each pixel recursively 
from neighboring pixel contexts along the axes, that is, pixels on a simplex can be processed in 
parallel. Turning this order by 45° (b) causes the simplex to become a plane (a column vector 
in the 2D case here). The resulting gaps are filled by adding extra connections, to process more 
than 2 elements of the context (c). 

In MD-LSTMs, connections are aligned with the grid axes. In 2D, these directions 
are up, down, left and right. A 2D-LSTM adds the pixel-wise outputs of 4 LSTMs, one 
scanning the image pixel by pixel from north-west to south-east, one from north-east to 
south-west, one from south-west to north-east, and one from south-east to north-west. 

If the connections are rotated by 45°, all inputs to all units come from either left, 
right, up, or down (left in case of Eigure[^b). This greatly facilitates parallelization, 
since all the elements of a whole grid row can be computed independently, which does 
not work for MD-LSTM simplexes, whose sizes vary. However, this introduces context 
gaps as in Eigure[^b. By adding an extra input, these gaps are filled as in Eigure[^c. 

A similar connection strategy has been previously used to speed up non-euclidian 
distance computations on surfaces | [T5| . There are however important differences: 

• We can exploit efficient GPU-based CUDA convolution operations, but in a way 
unlike what is done in CNNs, as will be explained below. 

• As a result of these operations, input filters that are bigger than the necessary 
3x3 filters arise naturally, creating overlapping contexts. Such redundancy 
turns out to be beneficial and is used in our experiments. 

• We apply several layers of complex processing with multi-channeled outputs and 
several state-variables for each pixel, instead of having a single value per pixel 
as in distance computations. 

• Our application is focused on volumetric data. 
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Figure 2: On the left we see the context scanned so far by one of the 8 LSTMs of a 3D-LSTM: 
a cube. In general, given d dimensions, 2“^ LSTMs are needed. On the right we see the context 
scanned so far by one of the 6 LSTMs of a 3D-PyraMiD-LSTM: a pyramid. In general, 2 x d 
LSTMs are needed. 


One of the striking differences between PyraMiD-LSTM and MD-LSTM is the 
shape of the scanned contexts. Each LSTM of an MD-LSTM scans rectangle-like 
contexts in 2D or cuboids in 3D. Each LSTM of a PyraMiD-LSTM scans triangles 
in 2D and pyramids in 3D (see Ligure|^. An MD-LSTM needs 8 LSTMs to scan a 
volume, while a PyraMiD-LSTM needs only 6, since it takes 8 cubes or 6 pyramids to 
fill a volume. Given dimension d, the number of LSTMs grows as 2'^ for an MD-LSTM 
{exponentially) and 2 x d for a PyraMiD-LSTM {linearly). 

2.2 PyraMiD-LSTM 



Input Data 



Fully 
Connect ed| 
Layer |tanh| 


Fully 

Connected 
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Figure 3: PyraMiD-LSTM network architecture. Randomly rotated and flipped inputs are sam¬ 
pled from random locations, then fed to six C-LSTMs over three axes. The outputs from all 
C-LSTMs are combined into one PyraMiD-LSTM layer and sent to the fully-connected layer. 
tank is used as a squashing function in the hidden layer. Several PyraMiD-LSTM layers can be 
applied. The last layer is fully-connected and uses a softmax function to compute probabilities 
for each class for each pixel. 


Here we explain the PyraMiD-LSTM network architecture for 3D volumes (see 
Ligure0. It consists of six LSTMs with RNN-tailored convolutions (C-LSTM), one for 
each direction, to create the full context of each pixel. Note that each of these C-LSTMs 
is a entire LSTM RNN, processing the entire volume in one direction. The directions 
T) are defined over the three axes {x, y, z): !? = {(•, •, 1), (•, •, —1), (•, 1, •), (•,—!, •)) 
( 1 ,-, 

Each C-LSTM performs computations in a plane moving in the defined direction. 
The symbol (•) in a dimension signifies that the plane is parallel to this axis, and a 
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1 or — 1 implies that the computation is moving along the positive or negative di¬ 
rection of that axis, respectively. The input is x G where W is the 

width, H the height, D the depth, and C the number of channels of the input, or hid¬ 
den units in the case of second- and higher layers. Similarly, we define the volumes 

i‘^, o‘^, h G T^i^xHxDxO^ where d G V is a direction and O is the num¬ 

ber of hidden units (per pixel). Since each direction needs a separate volume, we denote 
volumes with (•)‘^. 

To keep the equations readable, we omit the time index below; ht becomes h and 
ht-i becomes h-i. The time index t is bound to the axis along which computations are 
performed. For instance, for direction d = (•, •, 1), v‘^ refers to the plane orthogonal 
to axis z; i.e. Vx,y,z,c for x = l..X,y = 1..F, c = 1..C, and z = t. For a negative 
direction d = (•, •, — 1), the plane is the same but moves in the opposite direction; 
z = Z — t. Furthermore, refers to the previous plane, in this case Vx,y,z,c for 
X = l..X,y = l..Y,c = 1..C', z = t — 1. A special case is the first plane in each 
direction, which does not have a previous plane, hence we omit the corresponding 
computation. The following functions are defined for all planes and all directions; 

C-LSTM: 


id = a{x * oi, + 

(7) 

fd = a{x * 

(8) 

= tanh(x * -f 

(9) 

■’S 

O 

+ 

■ 

© 

II 

(10) 

od = a {x* * ei^ + 0iuas), 

(11) 

hd = 0^ Q tanh(c‘^), 

(12) 


where (*) is a RNN-tailored convolutiorj^ and h is the output of the layer. Note 
that in C-LSTMs the convolution operations propagate information ‘sideways’ over the 
axis of the data. This is very different from CNNs where the convolution operations 
propagate information upwards to the next layer. All biases are the same for all LSTM 
units (i.e., no positional biases are used). The outputs h'^ for all directions are added, 
combining all C-LSTMs into one PyraMiD-LSTM; 

h=Y,h'^- ( 13 ) 

d^V 

Fully-Connected Layer: Each PyraMiD-LSTM layer is connected to a fully- 
connected layer, and the output is squashed by the hyperbolic tangent (tank) function. 
At the end, a softmax function is applied after the last fully-connected layer. 

3 Experiments 

We evaluate our approach on two 3D biomedical image segmentation datasets: electron 
microscopy (EM) and MR Brain images. 

^In 3D volumes, convolutions are performed in 2D; in general an n-D volume requires n-l-D convolu¬ 
tions. All convolutions have stride 1, and their filter sizes should at least be 3 x 3 in each dimension to create 
the full context. 
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EM dataset The EM dataset | [T3j is provided by the ISBI 2012 workshop on Seg¬ 
mentation of Neuronal Structures in EM Stacks ID- Two stacks consist of 30 slices 
of 512 X 512 pixels obtained from a 2 x 2 x 1.5 fim^ microcube with a resolution of 
4 X 4 X 50 nm^/pixel and binary labels. One stack is used for training, the other for 
testing. Target data consists of binary labels (membrane and non-membrane). 

MR Brain dataset The MR Brain images are provided by the ISBI 2015 workshop 
on Neonatal and Adult MR Brain Image Segmentation (ISBI NEATBrainSlS) p4) . 
The dataset consists of twenty fully annotated high-field (3T) multi-sequences: 3D Tl- 
weighted scan (Tl), T1-weighted inversion recovery scan (IR), and fluid-attenuated 
inversion recovery scan (FLAIR). The dataset is divided into a training set with five 
volumes and a test set with fifteen volumes. All scans are bias-corrected and aligned. 
Each volume includes 48 slices with 240 x 240 pixels (3mm slice thickness). The slices 
are manually segmented through nine labels: cortical gray matter, basal ganglia, white 
matter, white matter lesions, cerebrospinal fluid in the extracerebral space, ventricles, 
cerebellum, brainstem, and background. Following the ISBI NEATBrainS 15 workshop 
procedure, all labels are grouped into four classes and background: 1) cortical gray 
matter and basal ganglia (GM), 2) white matter and white matter lesions (WM), 3) 
cerebrospinal fluid and ventricles (CSF), and 4) cerebellum and brainstem. Class 4) is 
ignored for the final evaluation as required. 

Sub-volumes and Augmentation The full dataset requires more than the 12 GB of 
memory provided by our GPU, hence we train and test on sub-volumes. We randomly 
pick a position in the full data and extract a smaller cube (see the details in Bootstrap¬ 
ping). This cube is possibly rotated at a random angle over some axis and can be flipped 
over any axis. For EM images, we rotate over the z-axis and flipped sub-volumes with 
50% chance along x, y, and z axes. For MR brain images, rotation is disabled; only 
flipping along the x direction is considered, since brains are (mostly) symmetric in this 
direction. 

During test-time, rotations and flipping are disabled and the results of all sub¬ 
volumes are stitched together using a Gaussian kernel, providing the final result. 

Pre-processing We normalize each input slice towards a mean of zero and variance 
of one, since the imaging methods sometimes yield large variability in contrast and 
brightness. We do not apply the complex pre-processing common in biomedical image 
segmentation ID- 

We apply simple pre-processing on the three datatypes of the MR Brain dataset, 
since they contain large brightness changes under the same label (even within one 
slice; see Figure]^. From all slices we subtract the Gaussian smoothed images (filter 
size: 31 x 31, cr = 5.0), then a Contrast-Limited Adaptive Histogram Equalization 
(CLAHE) p7) is applied to enhance the local contrast (tile size: 16 x 16, contrast limit: 
2.0). An example of the images after pre-processing is shown in Figure]^ The original 
and pre-processed images are all used, except the original IR images (Figure[5b]), which 
have high variability. 
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Training We apply RMS-prop |18 19 

a„ = pan -I- (1 — p)bn, where a,b G R 
epoch: 


with momentum. We dehne a 4- 6 to be 
^ The following equations hold for every 


E = {y*- yf, (14) 

MSE (15) 

G = • , — -, (lo) 

VM^ + e 

M ^G, (17) 

e = e-\irM, ( 18 ) 


where y* is the target, y is the output from the networks, E is the squared loss, MSE 
a running average of the variance of the gradient, is the element-wise squared gra¬ 
dient, G the normalized gradient, M the smoothed gradient, and 9 the weights. This 
results in normalized gradients of similar size for all weights, such that even weights 
with small gradients get updated. This also helps to deal with vanishing gradients pO) . 

( j—epoch 

^ V i starts 

at Xir « 10“^ and halves every 100 epochs asymptotically towards Xi^ = 10“®. Other 
hyper-parameters used are e = 10“®, pmse = 0.9, and pM = 0.9. 

Bootstrapping To speed up training, we run three learning procedures with increas¬ 
ing sub-volume sizes: hrst, 3000 epochs with size 64 x 64 x 8, then 2000 epochs 
with size 128 x 128 x 15. Finally, for the EM-dataset, we train 1000 epochs with size 
256 X 256 X 20, and for the MR Brain dataset 1000 epochs with size 240 x 240 x 25. 
After each epoch, the learning rate Xir is reset. 

Experimental Setup All experiments are performed on a desktop computer with an 
NVIDIA GTX TITAN X 12GB GPU. For GPU implementation, the NVIDIA CUDA 
Deep Neural Network library (cuDNN) pT| is used. On the MR brain dataset, training 
took around three days, and testing per volume took around 2 minutes. 

We use exactly the same hyper-parameters and architecture for both datasets. Our 
networks contain three PyraMiD-LSTM layers. The first PyraMiD-LSTM layer has 
16 hidden units followed by a fully-connected layer with 25 hidden units. In the next 
PyraMiD-LSTM layer, 32 hidden units are connected to a fully-connected layer with 
45 hidden units. In the last PyraMiD-LSTM layer, 64 hidden units are connected to the 
fully-connected output layer whose size equals the number of classes. 

The convolutional hlter size for all PyraMiD-LSTM layers is set to 7 x 7. The total 
number of weights is 10,751,549, and all weights are initialized according to a uniform 
distribution: 7f(—0.1,0.1). 

3.1 Neuronal Membrane Segmentation 

Evaluation metrics Three error metrics evaluate the following factors: 
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Table 1: Performance comparison on EM images. Some of the competing methods reported in 
the ISBI 2012 website are not yet published. Comparison details can be found under http: 
//brainiac2.mit.edu/isbi_challenge/leaders-board 


Group 

Rand Err. 

Warping Err.( x 10 ^) 

Pixel Err. 

Human 



0.002 

0.0053 

0.001 

Simple Thresholding 

0.450 

17.14 

0.225 

IDSIA 1 

12 


0.050 

0.420 

0.061 

DIVE 



0.048 

0.374 

0.058 

PyraMiD-LSTM 

0.047 

0.462 

0.062 

IDSIA-SCI 

0.0189 

0.617 

0.103 

DIVE-SCI 


0.0178 

0.307 

0.058 


• Rand error p2) : 1 - F-score of rand index, which measures similarity between 
two segmentations on the foreground. 

• Warping error p?) : topological disagreements (object splits and mergers) 

• Pixel error; 1 - F-score of pixel similarity 



(a) Input (b) PyraMiD-LSTM 


Figure 4: Segmentation results on EM dataset (slice 26) 

Results Membrane segmentation is evaluated through an online system provided by 
the ISBI 2012 organizers. Comparisons to other methods are reported in Table [T] 
The teams IDSIA and DIVE provide membrane probability maps for each pixel, like 
our method. These maps are adapted by the post-processing technique of the teams 
SCI p^ , which directly optimizes the rand error (DIVE-SCI (top-1) and IDSIA-SCI 
(top-2)); this is most important in this particular segmentation task. Without post¬ 
processing, PyraMiD-LSTM networks outperform other methods in rand error, and 
are competitive in wrapping and pixel errors. Of course, performance could be fur¬ 
ther improved by applying post-processing techniques. Eigure shows an example 
segmentation result. 
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Table 2: The performance comparison on MR brain images. 


Structure 

GM 

WM 

CES 



DC 

MD 

AVD 

DC 

MD 

AVD 

DC 

MD 

AVD 

Rank 


(%) 

(mm) 

(%) 

(%) 

(mm) 

(%) 

(%) 

(mm) 

(%) 


BIGR2 

84.65 

1.88 

6.14 

88.42 

2.36 

6.02 

78.31 

3.19 

22.8 

6 

KSOM GHME 

84.12 

1.92 

5.44 

87.96 

2.49 

6.59 

82.10 

2.71 

12.8 

5 

MNAB2 

84.50 

1.69 

7.10 

88.04 

2.12 

7.73 

82.30 

2.27 

8.73 

4 

ISI-Neonatology 

85.77 

1.62 

6.62 

88.66 

2.06 

6.96 

81.08 

2.66 

9.77 

3 

UNC-IDEA 

84.36 

1.62 

7.04 

88.69 

2.06 

6.46 

82.81 

2.35 

10.5 

2 

PyraMiD-LSTM 

84.82 

1.69 

6.77 

88.33 

2.07 

7.05 

83.72 

2.14 

7.10 

1 


3.2 MR Brain Segmentation 

Evaluation metrics The results are compared based on the following three measures: 

• The DICE overlap (DC) | [25| : spatial overlap between the segmented volume and 
ground truth 

• The modihed Hausdorff distance (MD) p6| : 95th-percentile Hausdorff distance 
between the segmented volume and ground truth 

• The absolute volume difference (AVD) | [Z7l : the absolute difference between 
segmented and ground truth volume, normalized over the whole volume. 


Results MR brain image segmentation results are evaluated by the ISBI NEATBrainlS 
organizers 1 14 1 who provided the extensive comparison to other approaches on http : 

/ /mrbrainslS .isi.uu.nl/results. php. Table compares our results to 
those of the top hve teams. The organizers compute nine measures in total and rank all 
teams for each of them separately. These ranks are then summed per team, determining 
the final ranking (ties are broken using the standard deviation). PyraMiD-LSTM leads 
the final ranking with a new state-of-the-art result and outperforms other methods for 
CES in all metrics. 

We also tried regularization through dropout | [28) . Eollowing earlier work p9| , 
the dropout operator is applied only to non-recurrent connections (50% dropout on 
fully connected layers and/or 20% on input layer). However, this did not improve 
performance, but made the system slower. 


4 Conclusion 


Since 2011, GPU-trained max-pooling CNNs have dominated classihcation contests 130 


31 1^1 and segmentation contests p^. MD-LSTM, however, may pose a serious chal¬ 


lenge to such CNNs, at least for segmentation tasks. Unlike CNNs, MD-LSTM has an 
elegant recursive way of taking each pixel’s entire spatio-temporal context into account. 
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(a)Tl (b)IR (c) FLAIR 



(d) T1 (pre-processed) (e) IR (pre-processed) (f) FLAIR (pre-processed) 



(g) segmentation result from PyraMiD-LSTM 


Figure 5: Slice 19 of the test image 1. (a)-(c) are examples of three scan methods used in the 
MR brain dataset, and (d)-(f) show the corresponding images after our pre-processing procedure 
(see pre-processing in Section ). Input (b) is omitted due to strong artifacts in the data — the 
other datatypes are all used as input to the PyraMiD-LSTM. The segmentation result is shown in 
(g). 
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in both images and videos. Previous MD-LSTM implementations, however, could 
not exploit the parallelism of modern GPU hardware. This has changed through our 
work presented here. Although our novel highly parallel PyraMiD-LSTM has already 
achieved state-of-the-art segmentation results in challenging benchmarks, we feel we 
have only scratched the surface of what will become possible with such PyraMiD- 
LSTM and other MD-RNNs. 
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